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Abstract 


We propose information criteria that measure the prediction risk of a predictive 
density based on the Bayesian marginal likelihood from a frequentist point of view. 
We derive criteria for selecting variables in linear regression models, assuming a prior 
distribution of the regression coefficients. Then, we discuss the relationship between 
the proposed criteria and related criteria. There are three advantages of our method. 
First, this is a compromise between the frequentist and Bayesian standpoints because 
it evaluates the frequentist’s risk of the Bayesian model. Thus, it is less influenced by a 
prior misspecification. Second, the criteria exhibits consistency when selecting the true 
model. Third, when a uniform prior is assumed for the regression coeffici ents, t he re¬ 
sulting criterion is equivalent to the residual information criterion (RIC) of Shi and Tsai 

(H). 
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1 Introduction 


The problem of selecting appr opriate models has been studied extensively in the literature 
since the work of Akaike f 19731 . 19741) . who derived the so-called Akaike information criterion 
(AIC). There are several approach es to s olving the model selection problem: information 
criteria, such as the AIC or BIC (Schwarz, 19781) : shrinkage methods, such as the lasso 
(Tibshirani, 19961) : Bayesian techniques; among others. With regard to Bayesian techniques, 


O’Hara and Sillannaa ( 20091) provide a good revi ew of the key works in this fi e ld, includin g 


Kuo and Mallickl ( 1998 1. Dellaportas et al. ( 1997 1. and George and McCulloch ( 19931. 1997I) . 


In addition, a Bayesian lasso procedure based on a spike-and-slab prior has attracted much 
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recent attention ( Xu and Ghoshl . 12015 ). However, although these methods are useful and 
important, we focus on information criteria in this study. 

Two of the fundamental information criteria are the AIC and the BIC. Because the AIC 
and its variants are based on the risks of predictive densities with respect to the Kullback- 
Leibler (KL) divergence, they are able to select good models in terms of their predictive 


neioier fJvnj d ivergence , tney are a bl e to select good models m terms 01 tneir predictive 
ability. In fact, Shibata ( 198lh . Shaol f 1997fk and others have shown that the model selected 
by the AIC minimizes the prediction error asymptotically. However, it is known that AIC- 
type criteria do not have the property of consistency; that is, the probability that the criteria 
will select the true model does not converge to 1. On the other hand, the BIC, b ased o n 
the Bayesian marginal likelihood, does exhibit consistency in certain specific models (Nishii, 
19841 ). but does not select models as effectively in terms of their predictive ability. Therefore, 
we propose a hybrid of the AIC and the BIC that uses the empirical Bayesian method, which 
has the property of consistency, but also selects models well in terms of predictive ability. 


Our approach is to measure the prediction risk of a predictive density based on the 
Bayesian marginal likelihood from a frequentist point of view. Specifically, we focus on the 
variable selection problem for normal linear regression models, assuming a prior distribution 
of the regression coefficients in order to derive the criterion. In Section [31 we consider two 
prior distributions, namely the normal distribution and the uniform prior distribution. There 
are three advantages of our method. First, this is a compromise between the frequentist 
and Bayesian standpoints because it evaluates the frequentist’s risk of the Bayesian model. 
Thus, the method should be less influenced by a prior misspecification. Second, the criteria 
exhibits consistency when selecting the true model. At the same time, our proposed criteria 
can select a good model, in the sense that the prediction risk is small, because the criteria 
are based on the KL divergence. Lastly, a non-informative improper prior can be also used 
to construct criteria using our approach. If we consider the Bayesian risk, which is the 
KL risk, integrated out with respect to the parameters based on the prior distribution, it 
diverges when the prior is improper. On the other hand, when we assume a uniform improper 
prior for the regression coefficients in the normal linear regression model, we can formally 
derive the marginal likelih ood. In this case, t he resulting marginal likelihood is the so-called 
residual likelihood flPatterson and Thompson! . 197l|) . and th e proposed in f ormat ion criterion 
is equivalent to the residual information criterion (RIC) of Shi and Tsai| (120021) . Thus, our 
approach can be considered a theoretical justification of the RIC. 


The rest of the paper is organized as follows. In Section [21 we provide a unified framework, 
which we use to derive the proposed information criteria, that can produce various informa¬ 
tion criteria, including the AIC, BIC, and RIC. Then, we propose a new approach that uses 
the Bayesian marginal likelihood in the general framework, and compare various information 
criteria that are based on a Bayesian model. In Section [31 we derive our information criteria 
for the variable selection problem in a normal linear regression model, assuming a prior dis¬ 
tribution of the regression coefficients. In this section, we also prove the consistency of the 
criteria. In Section [2 we use simulations to verify the numerical performance of the proposed 
criteria. Lastly, Section [5] concludes the paper. 
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2 General 


2.1 Conventional information criteria 


In this section, we describe the concept of information criteria from a general point of view. 
Let y be an n- variate observable random vector, with density f(y |u>) for a vector of unknown 
parameters u;. Let f(y ; y) be a predictive density of f(y\u:), where y is an n-variate inde¬ 
pendent replication of y. Here, we evaluate the predictive performance of f(y',y ) in terms 
of the following risk: 


R(uJ) 



f(y M 

/(y; y) 


f(y\u>)dy 


f(y\uj)dy. 


( 1 ) 


Because this is interpreted as a risk with respect to the KL divergence, we call it the KL risk. 
The spirit of AIC suggests that we can provide an information criterion for model selection 
as an (asymptotically) unbiased estimator of the information, as follows: 


!(<*>;/) = JJ -2log{f(y,y)}f(y\u)f(y\u)dydy 


= E„ 


-2 log {f(y,y)} , 


( 2 ) 


which is part of (pQ) (multiplied by 2), where denotes the expectation, with respect to the 
distribution, of f{y,y\u>) = f {y\u) f (y\u)^ Let A = J(u>;/) - E u [-2\og{f(y,y)}]. Then, 
the AIC variant based on the predictor f(y\ y) is defined by 


!C (/) = -2 log {/(y; y)} + A, 


(3) 


where A is an (asymptotically) unbiased estimator of A and f(y ; y) is the value of the 
function f(y ; y) evaluated at y — y. 

Note that IC(/) produces the AIC and BIC for specific predictive densities. 


(AIC) Use f(y;y) = f{y |w) as the maximum likelihood estimator Q of lo. Then, 


IC(f(ty|o))) is the exact AIC, or is the corrected AIC suggested bv ISugiural (19781) and 


Hurvich and Tsai ( 1989h . which is approximated by the AIC of Akaike ( 1973 


-2 \og{f(y\u)} + 2 dim (a;). 


1974) 


as 


(BIC) Use f(y; y) = f no (y ) = / /(y|a;)7r 0 (a;)du? as a proper prior distribution 7r 0 (u;). 
It is evident that /(u>; f no ) = E UJ [—2\og{f no (y)}]. Thus, we have A = 0, so that lC(f no ) = 
—2log {fn 0 (y)}, which is the Bayesian marginal likelihood. Note that —2\og{f no (y)} is ap¬ 
proximated by BIC = — 2log{/(y|u))} + log(n) • dim(u>). 


2.2 Proposed approach 

The criterion IC(/) in ([3]) can produce not only the conventional AIC and BIC, but also 
various other criteria. Hereafter, we consider that u) is divided as uj = (f3 for a p- 
dimensional parameter vector of interest (3, and a q- dimensional nuisance parameter vector 
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G, respectively. We assume that j3 has prior density 7r(/3|A, G), with hyperparameter A. The 
model is given as follows: 


y 1/3 ~ f{y\P,o), 

f3 ~ 7t(/3|A, G), 

where G and A are estimated from the data. An inference based on such a model is called an 
empirical Bayes procedure. Here, we consider the predictive density f(y;y) as 


/(»;») = U(y\\e) = /'/(g|/3,§M/3|A,§)d/3 


for some estimators, A and G. Then, the information given in (J2]) is 


I(w,U)= ff -‘ 2l og{f n {y\X,G)}f(y\f3,G)f(y\(3,G)dydy, (4) 

and the resulting information criterion is 

IC (A) = -2\og{f n (y\\G)} + A /w , (5) 

where Af n is an (asymptotically) unbiased estimator of = /(u>; f n )—E w [- 2 log{f n (y\ A, 0)}]. 

There are three motivations for considering the information J(cc>; f n ) in (J4|) and the infor¬ 
mation criterion lC(f n ) in ((S]). 

First, the precision of the Bayesian predictor f n (y\X, G) is characterized by the risk 
R(w, f n ) in (JTJ) , which is based on a frequentist point of view. On the other hand, the 
Bayesian risk is defined by 

r O;/) = J #(w;/)7r(/3|A,0)d/3, (6) 


which measures the prediction error of f(y‘,y), under the assumption that the prior infor¬ 
mation's correct, where = (A*, 0*)*. The resulting Bayesian criteria, such as the PIC 
(IKitagawal. 119971) or the DIC (jSpiegelhalter et all 120021) . are sensitive to a prior misspecihca- 
tion because they depend on the prior information. However, because R(uj; f n ) can measure 
the prediction error of the Bayesian model from a frequentist standpoint, the resulting crite¬ 
rion lC(f n ) is less influenced by a prior misspecihcation. 


Second, this criterion has the property of consistency. In Section [3l we derive criteria for 
the variable selection problem in normal linear regression models, and prove that the criteria 
select the true model with probability tending to one. The BIC and marginal likelihood 
are known to exhibit consistency, while most AlC-type criteria are not consistent. However, 
AlC-type criteria can choose a good model in the sense of minimizing the prediction error. 
Our proposed criterion should include both properties, namely consistency when selecting the 
parameters of interest /3, and the tendency to select a good model in terms of its predictive 
ability. 
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Lastly, we can construct the information criterion IC(/ T ) even when the prior distribution 
of (3 is improper, because the information /(cj; f n ) in (j3J) can be defined formally for the cor¬ 
responding improper marginal likelihood. However, because the Bayesian risk r(xp\ f n ) does 
not exist for the improper prior, we cannot obtain the corresponding Bayesian criteria or use 
the Bayesian risk. Note that our criterion is equivalent to the residual information criterion 
(RIC) of Shi and Tsail ( 2002 1 if we assume a uniform prior on the regression coefficients. In 
general, the marginal likelihood based on an improper prior depends on an arbitrary scalar 
constant, which can be included in the prior, but that might be problematic when selecting 
the model. However, the criterion based on our approach, using a uniform prior, can work 
as a variable selection criterion. This is discussed further in Remark Q] in Section [3J 


2.3 Other information criteria based on a Bayesian model 

To clarify our proposed approach, we first explain related information criteria that are based 
on a Bayesian model. When the prior distribution 7r(/3| A, 6) is proper, we can treat the 
Bayesian prediction risk r(xp] /) in ((HD. When xp = (A*, 6*)* is known, the predictive den¬ 
sity f(y ; y) that minimizes r(xp] /) is the Bayesian predictive density (posterior predictive 
density) fp{y\y,xp), given by 




f f(y\P,e)T r (t3\\,e)dl3 


When xp is unknown, we can consider the Bayesian risk of the plug-in predictive den¬ 
sity f*(y \y, xp) . In this case, the resulting criterion is known as the predictive likelihood 
( Akaik e. 198 0al) or the P IC (IKitagawal. 119971) . The deviance information criterion (D IC) of 


Spiegelhalte r et al. 020021 ) and the Bayesian predictive information criterion (BPIC) of Ando 


( 2007 ) are related criteria based on the Bayesian prediction risk r(xp; /). 

Akaike’s Bayesian information criterion (ABIC) (Akaike, 1980b) is another information 
criterion based on the Bayesian marginal likelihood, given by 


ABIC = -2 log{f n (y\ A)} + 2 dim(A), 


where the nuisance parameter 6 is not considered. The ABIC measures the following KL 
risk: 


log 


fM*) 

fn(y\X) 


f n (y\\)dy 


f n (y\\)dy, 


which is not the same as either f) or r(xp] /). The ABIC is used to choose the hyperpa¬ 
rameter A in the same manner as the AIC. However, note that the ABIC works as a model 
selection criterion for /3 because it is based on the Bayesian marginal likelihood. 
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3 Application to Linear Regression Models 

3.1 Criteria 


In this section, we derive variable selection criteria for normal linear regression models. First, 
we consider a collection of candidate models, defined as follows. Let the n x p u matrix X u 
consist of all explanatory variables, and assume that rank ( X u ) = p w . In order to define 
candidate models using the index set j, suppose that j denotes a subset of cu — {1,. .. ,p u } 
containing pj elements (he., p 3 = #(j)) and that Xj consists of pj columns of X u indexed 
by the elements of j. We define the class of candidate models as J = V(uj), namely the 
power set of oo, where u> denotes the full model. We assume that the true model exists in 
the class of the candidate models J , which is denoted by j*- Note that the dimension of the 
true models is pj t , which we abbreviate to p*. 

The candidate model j is the linear regression model 

y = x iPj + 00 

where y is an n x 1 observation vector of the response variables, Xj is an n x pj matrix 
of the explanatory variables, f3j is a pj x 1 vector of the regression coefficients, and e is an 
n X 1 vector of the random errors. Here, e has the distribution A/" n (0, cr 2 V), where a 2 is an 
unknown scalar and V is a known positive definite matrix. 

We consider the problem of selecting the explanatory variables, and assume that the true 
model can be expressed by each candidate model. This is the common assumption used to 
derive an information criterion. Under this assumption, the true mean of y can be written 
as 

E(y) = Xj(3*, 

where (3* is a pj x 1 vector, the pj — p* components of which are exactly 0 , and the remaining 
components are not 0 . Hereafter, we omit the model index, j, for notational convenience. 
Furthermore, we abbreviate (3* as f3. 

Now, we construct the variable selection criteria for the regression model (j7|l . which has 
the form (J5]). We consider the following two situations. 


[i] A normal prior for (3. We first assume a normal prior distribution for /3, 

7t(/3|ct 2 ) ~ A/"(0, <j 2 W), 

where W is a p x p matrix, suitably chosen with full rank. Examples of W are W = 
( XX t X )~ 1 for A > 0, when V is the identity matrix, as introduced by Zellner (119861 1. or 


more simply, W = For the moment, we assume that A is known. We discuss how 

to determine it in Section lT2l Because the likelihood is f(y\/3, a 2 ) ~ Af(X/3,a 2 V), the 
marginal likelihood function is 

fAvW 2 ) = [ f{y\P,° 2 )n((3\a 2 )d(3 


= (2tt a 2 )~ n/2 ■ |V|~ 1/2 ' | WXW- l X + I p 


1 - 1/2 


exp {-y*Ay/(2a 2 )} , 
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where A = V ^ - V^X^XW^X + iry^'V 1 . Note that A = (V + B)~ l for 
B = XWX*] that is f n (y\a 2 ) ~ A/"(0, a 2 (V + B)). Then, we take the predictive density as 
f(y; y ) = fn{y\o' 2 ), and the information (j3| can be written as 

I«, i(w) = E u [n log(27r<j 2 ) + log |V| + log | WX^X + I p \ + yAy/a 2 ] , (8) 

where a 2 = yfPyfn, P = V -1 — V~ l X(X t V~ 1 X)~ 1 X t V~ l , and E u denotes the expecta¬ 
tion with respect to the distribution of f(y, y\/3, a 2 ) = f{y\/3, a 2 )f(y\/3, a 2 ) for u> = (/3*, a 2 )*. 
Note that /3 is the parameter of interest, and cr 2 is the nuisance parameter corresponding to 
6 in the previous section. Then, we propose the following information criterion. 

Proposition 1 The information (co) in (J8)) is unbiasedly estimated by the information 
criterion 

On 

IC.p = -21og{/ 7r ( 2/ |a- 2 )} +--, (9) 

n — p — 2 

where 


-2 log{f n (y\a 2 )} = n log(27r<T 2 ) + log \V | + log | WX'V^X + I p \ + y'Ay/a 2 ; 


that is, E u ( IC^i) = (u>). 

If n~ l w 1/2 x t v~ l xw l/2 converges to a p x p positive definite matrix as n — y oo, 
log | WX t V~ 1 X + I P \ can be approximated as p log n, which is the penalty term of the BIC. 
In that case, ICAj is approximately expressed as 

IC* ;1 = n log(27rd 2 ) + log |V| + p log n + 2 + y t Ay/a 2 


when n is large. 

Note that only the first term of IC^i can work as a variable selection criterion because 
firiyW 2 ) is the Bayesian marginal likelihood. The difference between them is 


2 n 

n — p — 2 


2+ 2(p+2) 

n 


+ 0(n 2 ). 


In other words, IC^i has a slight additional penalty, of order n 1 . We compare the perfor¬ 
mance of the criteria using simulations in Section [4j 

Alternatively, the KL risk r(i/>; /) in (J6J) can be used to evaluate the risk of the predictive 
density f n (y\d 2 ) because the prior distribution is proper. Then, the resulting criterion is 


IC ,,2 = nlog(27rd 2 ) + log |V| +plogn + p, 


( 10 ) 


which is an asymptotically unbiased estimator of 1^,2 (o' 2 ) = E n [I 7 r) i(u>)], where E n denotes 
the expectation with respect to the prior distribution 7r(/3|cr 2 ); that is IC^) IttA® 2 ) 

as n —> oo. Interestingly, IC^- i2 is analogous to the criterion proposed by Bozdogan ( 19871 ). 
known as the consistent AIC, who suggested replacing the penalty term 2 p in the AIC with 
p + p log n. 
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[ii] Uniform prior for (3. We next assume a uniform prior for /3. namely (3 ~ uni f orm{W). 
Although this is an improper prior distribution, we can obtain the marginal likelihood func¬ 
tion formally, as follows: 


fr(yW 2 ) = I f(y\(3,a 2 )df3 

= (2vra 2 )- (ri - p)/2 • |U|~ 1/2 • • exp {-y t Py/(2a 2 )} , 


which is known as the residual likelihood ( Patterson and Thompson . 197l[) . Then, we take 
the predictive density as f(y ; y) = f r (y |cr 2 ), and the information (d]) can be written as 


Ir M = E u [(n - p ) log(27Td 2 ) + log I V| + log IXV-'XI + tfPy/d 2 } , (11) 

where a 2 = y t Py/(n —p), which is the residual maximum likelihood (REML) estimator of 
a 2 , based on the residual likelihood f r (y\a 2 ). Next, we propose the information criterion. 


Proposition 2 The information J r (m) in CD is unbiasedly estimated by the infomation 
criterion 

IC, = -2 log{f r (y\d 2 )} + (12) 

Tt p Zj 

where 


— 2 log{f r (y\a )} = (n - p) log(27r<r) + log|V| + log \X V X\+y Py/a ; 


that is, EufLCr) = / r (m). 

Note that y t Py/a 2 — n — p. If n~ x X t V~ l X converges to a p x p positive definite matrix 
as n —» oo , log|X t U _1 X| can be approximated by plogn. Then, we can approximate the 
criterion as 

(n — p) 2 


IC* = (n — p) log(27r a 2 ) + log | V | + p log n + 


n — p 


2 ’ 


(13) 


for large n. Note that IC* is equivalent to the RIC proposed by [Shi and Tsai ( 20021 ). Noting 
that (n — p) 2 /(n — p — 2) = (n + 2) + {4 /(n — p — 2) — p}, we can see that the difference 
between IC* and the RIC is n + 2 — plog(27rd 2 ); that is IC* = RIC + n + 2 — plog(27r<r 2 ). 
Note too that the criterion based on f r (y\a 2 ) and r(xf\ f r ) cannot be constructed because its 
KL risk diverges to infinity. 


Remark 1 As discussed in Section 12.21 the marginal likelihood based on an improper prior 
depends on an arbitrary scalar constant, which can, in general, be problematic when selecting 
the model. However, our IC r , or its equivalent RIC, can work as a variable selection criterion. 
In order to show that, we compare IC r with the AIC and BIC. When V = I n , the AIC and 
BIC for the normal linear regression model can be expressed as 


IC = ?zlog(27r) + ?rlog(?z 1 RSS) + n + g(p ), 















where the first three terms are the likelihood part, and the last term g(p ) is the penalty, 
which depends on p. Here, g(p) = 2 (p + 1 ) for the AIC and g(p) = plog(n) for the BIC. 
Then, RSS is the residual sum of squares, defined as RSS = \\y — X/3\\ 2 = na 2 . On the other 
hand, IC r in (fT 2 j) can be rewritten as 

IC r = n log(27r) + n log(n^ 1 RSS) + n + h(p ), 

where the first three terms are the same as those of the AIC and BIC, and h(p) is 

h(p) = p{\og(n — p) — log(n _ 1 RSS) — log(27r) — 1} + log | X*X\ — plog(n) 

+ n\og{n/(n — p)} + 2 + 0 (n _1 ). 

Thus, h(p) can represent the penalty for the large model because log|X*X| — plog(n) is 
asymptotically negligible, and the value in the braces of the first term is positive when n is 
at least moderately large, noting that n _1 RSS becomes small as p becomes large. 


3.2 Typical examples of the linear regression models 


In the derivation of the criteria, we assumed that the scaled covariance matrix V of the 
vector of error terms is known. However, it is often the case that V is unknown, and is 
some function of the unknown parameter cj ), namely V — V((f)). In that case, V in each 
criterion is replaced with its plug-in estimator V(<f>), where (j) is some cons istent estimator 
of <p. This strategy is also used in many other studies, for example in Shi and Tsai ( 120021 ). 
who proposed the RIC. We suggest that the <p be estimated based on the full model. The 
scaled covariance matrix W of the prior distribution of (3 is also assumed to be known. In 
practice, its structure should be specified, and we have to estimate the parameter A in W 
from the data. In the same manner as V, W in each criterion is replaced with W(A). Note 
that A should be estimated based on each candidate model under consideration, because the 
structure of W depends on the model. We propose that A is estimated by maximizing the 
marginal likelihood f n (y\a 2 , \), after substituting in the estimate a 2 . 


Here, we give three examples for the regression model (j7|) : a regression model with con¬ 
stant variance, a variance components model, and a regression model with ARMA errors. 
The second and the third models include the unknown parameter in the covariance matrix. 


[1] Regression model with constant variance. When V = I n , © represents a multiple 

regression model with constant variance. In this model, the scaled covariance matrix V does 
not contain any unknown parameters. 


[2] Variance components model. 

1950f ). described as follows: 


Consider a variance components model (IHendersonl . 


y — X (3 + Z 2 V 2 + • • • + Z r v r + 77 , (14) 

where Z t is an nxrrii matrix with V* = Z^Z (, Vi is an m, x 1 random vector with distribution 
Af mi ( 0, 9il mi ) for i > 2 , 77 is an n x 1 random vector with 77 ~ J\T n ( 0, To + 9 Wi) for known 
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n x n matrices V$ and V \, and rj, v 2 ,..., v r are mutually independently distributed. The 
nested error regression model (NERM) is a special case of a variance components model, 
given by 

Uik = x\ k P + Vi + rj ik , (i = 1,..., m; k = 1,..., m), (15) 


where v t and rj l k are mutually independently distributed as v t ~ Af( 0, r 2 ) and rfck ~ Af(0, a 2 ), 
respectively, and n = YlT=\ n i- Note that the NERM in (fT5l) is given by 6\ = a 2 , 0 2 = 
r 2 , Vi = I n and Z 2 = diag (l ni ,..., l nm ), where 1; is the /-dimensional vector of ones, for 
the variance components model f|14p . This model is often used for clustered data, where u* 
is considered the random effect of the cluster ( iBattese et ah, 19881 ). For such a model, when 
we are interested in a specific cluster or in pr edicting the random ef fects, an appropriate 
criterion is the conditional AIC, as proposed by Vaida and Blanchardl ( 2005 ). which is based 
on the conditional likelihood given the random effects. However, when we wish to predict 
the fixed effects, namely x\ k (3, the NERM can be seen as a linear regression model and 
the random effects are part of the error term. In other words, we consider e = Z 2 v 2 + rp 
V = V{4>) = <j>V 2 + In for dZD, where </> = r 2 /a 2 and V 2 = Z 2 Z\ = diag (J ni ,..., J Um ) for 


Ji = 1 / 1 /. In this case, our proposed variable selection procedure is useful. 


[3] Regression model with autoregressive moving average errors. Consider the 
regression model (I7j) , assuming the random errors are generated by an ARMA(g, r) process 
defined by 

£i ' (jlq £i—q Wj ' ^r^i—ri 

where {«*} is a sequence of independent normal random variables, with mean 0 and variance 
t 2 . A special case of this model is the regression model with AR(1) errors, satisfying gq ~ 
AT(0,t 2 /(1 — 4> 2 )), £i = 4>£i~ i + Ui , and iq ~ A/"(0,r 2 ) for i = 2,3,... ,n. When we define 
a 2 = r 2 /( 1 — 0 2 ), the (i,.))-element of the scaled covariance matrix V in (JTj) is (jr l ~^. 


3.3 Consistency of the criteria 


In this subsection, we prove that the proposed criteria exhibit consistency. Our asymptotic 
framework is that n tends to infinity and the true dimension of the regression coefficients p* 
is fixed. Following Shi and Tsai ( 2002h . we first show that the criteria are consistent for the 
regression model with constant variance and pre-specihed W. Then, we extend the result to 
the regression model with a general covariance matrix and the case where W is estimated. 


We divide J into two subsets, J + and where J + — {j £ J : j* C j} and J_ = J\J+. 
Note that the true model j* is the smallest model in J + , and that E(y) = X[3 ]t , abbreviated 
to X^/3*, where /3* is a p* x 1 vector of the true re gressio n coefficients. Let j denote the 
model selected by some criterion. Following Shi and Tsai ( 2002fh we make the following 
assumptions: 


(Al) E{e\) < oo. 

(A2) 0 < liminf min \X\Xjn\ and limsupmax \X\XJn\ < oo. 

1 , woo jej ^ 

(A3) liminfn" 1 inf || X*(3, - | 2 > 0, where //, = XAX^X^X). 

n—>oo j£j- J J 
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We can now obtain the asymptotic properties of the criteria for the regression model with 
constant variance. 

Theorem 1 If assumptions (Al)-(A3) are satisfied, J+ is not empty, the £i’s are indepen¬ 
dent and identically distributed (iid), and Wj in the prior distribution of /3 - is pre-specified, 
then the criteria IC^i, IC* 1; ICR^, IC r , and IC* are consistent; that is P(j = j*) —* 1 as 
n —> oo. 

The proof of Theorem [Tj is given in Appendix [HI 

We next consider the regression model with a general covariance structure and the case 
where Wj is estimated from the data. In this case, V and Wj are replaced with their plug-in 
estimators V(<fi) and Wj(Xj), respectively. 

Theorem 2 Assume that (fi — (fi 0 and A 3 - — A^o tend to 0 in probability as n —> oo, for all 
j G J. In addition, assume that the elements ofV(<fi) and Wj(Xj) are continuous functions 
of (fi and Xj, respectively, and that V((fi) and Wj(Xj) are positive definite in the neighborhood 
of (fi Q and Xjfi, respectively, for all j G J. If assumptions (Al)-(A3) are satisfied when Xj 
and e are replaced with V~^ 2 Xj and e* = respectively, J + is not empty and e* are 

iid. Then, the criteria IC^i, IC* l7 IC^, IC r , and IC* are consistent. 

For the proof of Theorem [21 we use the same techniques as those used in the proof of 
Theorem [0 


4 Simulations 

In this section, we compare the numerical performance of the proposed criteria, IC^i and 
IC r , with that of conventional criteria, namely the AIC, BIC, DIC, and the marginal like¬ 
lihood. We consider two regression models: a regression model with constant variance and 
a regression model with AR(1) errors. These models are taken as examples of the linear 
model (J7[) given in Section 12721 In each simulation, 1000 realizations are generated from ([71). 
with (3 = (1,1,1,1, 0, 0, 0)* or (3 = (1,1, 0, 0, 0, 0, 0, )*. That is, the full model is p u = 7 and 
the true model is p* = 4 or p* = 2. All explanatory variables are randomly generated from 
the standard normal distribution. The signal-to-noise ratio (SNR = (var(a3*/3)/var(ej)} 1 ^ 2 ) 
is controlled at 1, 3, and 5. For the regression model with AR(1) errors, we set the AR 
parameter as (fi = 0.5. 

When deriving the criterion IC^i, we set the prior distribution of (3 as J\f p (Q,a 2 X~ l I p )-, 
that is, W = X~ 1 I P . The unknown parameter (fi in V for the AR(1) model is estimated 
using the maximum likelihood estimator based on the full model. The hyperparameter A 
is estimated by maximizing the marginal likelihood f n (y \d 2 , A), after substituting in the 
estimate a 2 = y^Py/n of a 2 . Note that (fi is estimated based on the full model, while a 2 and 
A are estimated from each candidate model using the plugged-in version of V(cfi). 
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As a competitor for IC^i, we consider the criterion that uses f 7T (y\d' 2 ) only, which is 
the so-called the marginal likelihood commonly used in Bayesian analyses. Another com¬ 
petitor is the DIC, which is also popular in Bayesian analyses. When deriving the DIC, we 
consider the same prior distribution of (3 as that assumed when deriving IC^i, namely (3 ~ 
Af p (0, a 2 X~ 1 I p ). In fairness to the other criteria, we take a 2 as an unknown parameter and do 
not assume a prior distribution for the derivation of the DIC. Let D((3) = —2 log{/(y|/3, cr 2 )}. 
When a 2 is known, the DIC is 

DIC(cr 2 ) = 2Ep\ y [D((3)] — D(f3), 

where Ep\ y denotes the expectation with respect to the conditional distribution of (3, given 
y and p = E^ y (/3). Because (3\y ~ J\f p 0, + W 1 )" 1 ) for p = (XW^X + 

W~ 1 )~ 1 X t V~ 1 y, the first term of the DIC is 

2 Ep\ y E[D(l3)] = 2tr + 0 }\ /a 2 - AyW^Xp/a 2 

+ (the term which is irrelevant to the model), 


and the second term is D(f3) = (y — XP)*V x (y — X/3)/a 2 + (the term that is irrelevant to 
the model). Then, we use DIC(d 2 ), where a 2 is substituted into DIC (cr 2 ). 

We consider all subsets of the full model as the class of the candidate models. The 
performance of the criteria is measured as the number of simulations that select the true 
model and the prediction error of the selected model, based on the quadratic loss; that is 
|| XjjP- — X*/3*|| 2 /n, where (3 ) = (X f -W(0)^ 1 Xj)^ 1 WjW(0)~ 1 7/ (i.e., the GLS estimator). 

First, we confirm that IC^i, IC r , and the BIC are consistent. Figure |T] shows the number 
of simulations that select the true model among 1000 realizations of the regression model 
with constant variance. The resuls show that each criterion is consistent. When the data are 
noisy (i.e., the SNR is weak), IC r does not perform as well as IC^i and the BIC in terms of 
selecting the true model. Although we omit the detail, the results for the AR(1) model are 
similar to those of the model with constant variance. 


Next, we investigate the performance of the criteria in terms of the prediction error. Tables 
mm show the mean value of the prediction error among 1000 simulations. In each case, we 
put an asterisk at the minimum value of the prediction error. In many cases, IC^i performs 
best, except when the sample size is small and the SNR is weak. Although the performance 
of the marginal likelihood (ML) is similar to IC^i, that of IC^i is slightly better. For the 
noisy and small sample size data, which is the exception to the good performance of IC^i, 
IC r performs very well, except for the regression with constant variance with p* = 2, shown 
in Table [2j In this case, the result in Figure [D show that IC r can not select the true model. 
In this case, IC r tends to select larger models. Shi and Tsail (120028) also pointed out that the 
RIC, which is equivalent to IC r , does not perform well when the sample size is small and the 
SNR is weak. On the other hand, IC r performs best in terms of the prediction. 
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Figure 1: The number of simulations that select the true model by the criteria in 1000 
realizations of the regression model with constant variance. The left three figures are the 
result for p* = 2, and the right three figures are for p * = 4. 
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Table 1: The mean of the prediction error among 1000 simulations for the regression model 
with constant variance. The true model is p* = 4. 




SNR 

ia,i 

IC r 

AIC 

BIC 

DIG 

ML 



1 

1.67 

*1.41 

1.56 

1.71 

1.42 

1.63 

n = 

20 

3 

*0.105 

0.130 

0.132 

0.122 

0.152 

0.106 



5 

*0.0373 

0.0436 

0.0492 

0.0456 

0.0562 

0.0377 



1 

0.721 

*0.698 

0.699 

0.850 

0.702 

0.713 

n = 

40 

3 

*0.0516 

0.0598 

0.0643 

0.0560 

0.0764 

0.0520 



5 

*0.0167 

0.0189 

0.0224 

0.0194 

0.0268 

0.0168 



1 

*0.274 

0.326 

0.299 

0.275 

0.348 

0.275 

n = 

80 

3 

*0.0241 

0.0274 

0.0314 

0.0254 

0.0380 

0.0241 



5 

*0.00877 

0.00972 

0.0120 

0.00984 

0.0144 

0.00878 


Table 2: The mean of the prediction error among 1000 simulations for the regression model 
with constant variance. The true model is p* = 2. 




SNR 

ia,i 

ICV 

AIC 

BIC 

DIG 

ML 



1 

0.524 

0.687 

0.604 

0.561 

0.696 

*0.522 

n = 

20 

3 

*0.0337 

0.0486 

0.0577 

0.0484 

0.0757 

0.0347 



5 

*0.0116 

0.0151 

0.0219 

0.0185 

0.0281 

0.0117 



1 

*0.207 

0.304 

0.261 

0.201 

0.342 

0.211 

n = 

40 

3 

*0.0155 

0.0206 

0.0279 

0.0203 

0.0382 

0.0157 



5 

*0.00476 

0.00590 

0.00969 

0.00699 

0.0134 

0.00477 



1 

*0.0920 

0.131 

0.1268 

0.0831 

0.170 

0.0926 

n = 

80 

3 

*0.00683 

0.00896 

0.01348 

0.00853 

0.01899 

*0.00683 



5 

*0.00242 

0.00296 

0.00516 

0.00333 

0.00720 

0.00242 


Table 3: The mean of the prediction error among 1000 simulations for AR(1) with cj) = 0.5. 
The true model is p* = 4. 




SNR 

ia,i 

IC r 

AIC 

BIC 

DIC 

ML 



1 

0.341 

*0.296 

0.374 

0.449 

0.360 

0.334 

n 

= 40 

3 

*0.0223 

0.0251 

0.0312 

0.0268 

0.0375 

0.0224 



5 

*0.00729 

0.00805 

0.0108 

0.00895 

0.0133 

0.00730 



1 

*0.179 

0.184 

0.206 

0.252 

0.231 

0.179 

n 

= 60 

3 

*0.0146 

0.0160 

0.0199 

0.0163 

0.0243 

*0.0146 



5 

*0.00494 

0.00527 

0.00722 

0.00583 

0.00885 

0.00496 



1 

*0.119 

0.131 

0.143 

0.154 

0.174 

0.119 

n 

= 80 

3 

*0.0109 

0.0118 

0.0151 

0.0121 

0.0186 

*0.0109 



5 

*0.00390 

0.00412 

0.00551 

0.00444 

0.00673 

*0.00390 
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Table 4: The mean of the prediction error among 1000 simulations for AR(1) with (j) = 0.5. 
The true model is p* = 2. 




SNR 

IC W) i 

IC r 

AIC 

BIC 

DIG 

ML 



1 

0.341 

*0.296 

0.374 

0.449 

0.360 

0.334 

n 

= 40 

3 

*0.0223 

0.0251 

0.0312 

0.0268 

0.0375 

0.0224 



5 

*0.00729 

0.00805 

0.01079 

0.00895 

0.0133 

0.00730 



1 

*0.179 

0.184 

0.206 

0.252 

0.231 

0.179 

n 

= 60 

3 

*0.0146 

0.0160 

0.0199 

0.0163 

0.0243 

*0.0146 



5 

*0.00494 

0.00527 

0.00722 

0.00583 

0.00885 

0.00496 



1 

*0.119 

0.131 

0.143 

0.154 

0.174 

0.119 

n 

= 80 

3 

*0.0109 

0.0118 

0.0151 

0.0121 

0.0186 

*0.0109 



5 

*0.00390 

0.00412 

0.00551 

0.00444 

0.00673 

*0.00390 


5 Concluding Remarks 

We have derived variable selection criteria for normal linear regression models relative to 
the frequentist KL risk of the predictive density, based on the Bayesian marginal likelihood. 
We have proved the consistency of the criteria and, using simulations, have shown that they 
perform well in terms of the prediction. 

Although our theoretical approach is general, the derivation of the criterion depends on the 
normal distribution. If we assume a conjugate prior distribution for the parameter of interest 
when deriving the criterion, it is easy to extend our approach to other models. However, for 
the class of generalized linear models, which includes the Poisson and the logistic regression 
models, it is difficult to consider a prior distribution where the marginal likelihood can be 
evaluated analytically. In such models, we have to rely on some computational method, 
which we leave for future research. 

Variable selection for the mixed effects models, such as the variance components model 
(ITT]) in Section IHT1 is another important problem. As discussed, it is appropriate to consider 
the KL divergence based on the conditional density, given the random effects, when the 
objective is to predict the random effects, as in the conditional AIC (cAIC). An extension of 
our approach to the cAIC-type criterion is also left to future research. 
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A Derivations of the Criteria 


In this section, we show the derivations of the crit eria. To this end, we firs t obta in the 
following lemma, which was shown in Section A.2 of ISr i vastava and Kubokaw a (12010 1. 


Lemma 1 Assume that C is an n x n symmetric matrix, M is an idempotent matrix of 
rank p, and that u ~ J\f( 0, I n ). Then, 

tr(C) _ 2tr [C(I n - M)\ 
n — p — 2 (n — p)(n — p — 2) 


E 


u'Cu 


uHI n 


M)u 


A.l Derivation of IC^i in ([9]) 

It is sufficient to show that the bias correction = I n i(u>) — E u ,[—2log{f n (y\a 2 )}) is 
2 n/(n — p — 2), where I n i(u>) is given by (J8])• It follows that 

Ai = Euffi Ay/d 2 ) - E u ,(y t Ay / a 2 ) 

= E^ify Ay) ■ E^l/d 2 ) - E^y 1 Ay / dr 2 ). 


First, 

EUtfAy) = E u [(y -X(3 + X/3fA(y - Xf3 + X/3)} 
= a 2 tr (. AV) + fiX* AX(3. 

Second, noting that no 2 = y l Py = cr 2 u t (I n — M)u for 

u = V- 1/2 (y - X0)/a, 

M = I n — V~ 1/2 X(X t V- 1 X)- 1 X t V~ 1/2 , 


and that PX = 0, we obtain 

E^il/a 2 ) = nE u 


ytpy 

n 


= nE u 


<7 2 u t (J n — M)u 


Finally, 


E^y'Ay/d 2 ) = nE u 


a 2 (n — p — 2) 

y t Ay\ 


= nEu 


a 2 u t V 1/2 AV 1/2 u + /3 t X t AX/3 
a 2 u t (I n — M)u 


— n x 


y t Py J 

tr (AV) 2tr (AVPV) (3 t X t AX(3 


n — p — 2 {n — p){n — p — 2) a 2 (n—p — 2) 

The latter equation derives from Lemma [0 Combining (TT6|) . (jT8|) . and (TT9|) . we have 

2n ■ tr (AVPV) 


A^l — 


{n — p){n — p — 2) 


(16) 


(17) 


(18) 


( 19 ) 
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Therefore, 


tr ( AVPV ) = tr{(V + B)~\V + B- B)PV} 

= tr (PV) - tr {(V + B) 1 BPV} 

= tr (J n - M) — n — p, (20) 

because BP = XWX f P = 0. Then, we obtain A^.i = 2 n/(n — p — 2). □ 


A.2 Derivation of IC^ in (1 101) 

From the fact that ^(IC^i) = I n .\(uj) and that ^^(IC^i) = = I 7T , 2 {o' 2 ), it 

suffices to show that is approximated by 

E n Eu( IC^i) « E 7T E CJ [nlog(2TTa 2 ) + log|V| + p log n + 2 + y t Ay/a 2 } 

~ E n E LJ [nlog(2TTa 2 ) + log |V| + plogn + p\ + (n + 2) = + (n + 2), 

when n is large. Note that n + 2 is irrelevant to the model. It follows that 




= n x E u 


y t {V ~ 1 - V~ 1 X(X t V~ 1 X + W r_1 )- 1 X t V r_1 }y 


= n + n x Eu 
n 

= n + 

= n + 


cr 2 (n — p — 2) 
n 


y l {v~ l - r 1 i(i‘r 1 i)- 1 i i r 1 }i/ 

f/'r'ifi'r 1 ! + w r_ 1 )- 1 w r_ 1 (x t v _ 1 x)- 1 A; t v _ 1 y' 
y*{V _1 - r'lfl'r 1 !)- 1 ! 1 !' 1 }!/ _ 

x a 2 -tr{(X'V” 1 X + 


n 2 (n — p — 2) 

+ + w-'y'w- 1 ^ 

and that 


+ w 1 )- 1 w -1 #] = a 2 • tr [x f y~ 1 x(x t y 1 x + w 1 )- 1 ]. 

If n~ 1 X t V~ 1 X converges to a p x p positive definite matrix as n —> oo, tr [(X f V _1 X + 
W- 1 )-^- 1 } Oandtr[X t V^ 1 X(XV~ 1 X+TF' 1 )- 1 ] -> p. Then, we have E^E^^Ay/a 2 — 
n ) —y p, which we want to show. □ 


A.3 Derivation of IC r in (1121) 


We show that the bias correction A r = / r (u>) — Eu[—2 log{/ r (y|er 2 )}] is 2{n — p)/(n — p — 2), 
where I r (ui) is given by (fill) . Then, 


A r = Eu(yPy/o 2 ) - E^Py/cr 2 ) 
= Eu(y Py) ■ Eu{l/a 2 ) - ( n-p ). 
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Since E w (y t Py) — (n — p)c r 2 and E^{l/a 2 ) — (n — p)/{(? 2 {n — p — 2)}, we have A r = 
2(n — p)/(n — p — 2). □ 


B Proof of Theorem H 

We only prove the consistency of IC^i. The proof of the consistency of the other criteria can 
be shown in the same manner. Because we have 

for any j G J \ {j*}, it suffices to show that < IC^ijj*)} —* 0, or equivalently, 

that P{IC M (jj — ICyr.i (j* ) > 0} —y 1 as n —> oo for j e J \ { j *}. When V = I n , we obtain 

— IC w ,i(j;) = h + h + h , 


where 


h = n log(d 2 /d 2 ) + y t A j y/ d 2 - t/A*t//d 2 , 
h = log \X)Xj + WJ 1 ! - log 


h 


log{|W j |/|W*|} + 


2 n 

n — Pj — 2 


2 n 

n - p* — 2 ’ 


for d 2 = - iT^/n, d 2 = d 2 t , A, = I n - XpXrx, + A,-. = A*, 

and VF* = We evaluate the asymptotic behaviors of if, J 2 , and J 3 for j e J_, and 

.? e J+ \ {jo}, separately. 

[Case of j e i7_]. First, we evaluate ii. We decompose I\ = In + In, where In = 
nlog(d 2 /d 2 ) and In = t/A^y/d 2 - y t A*y/a 2 . It follows that 


d 2 - d 2 = {X./3, + e)\l n - Hj)(X>p, + e)/n - e\l n - HJe/n 
= \\X*P, - HjX*Pt\\ 2 /n + o p (l). 


Then, we have 


-i r , , || X.P.-HjX.P. H ( , ^ 

n I n = log ( H-—— ) = log <j H- — -} + o p (l), 


cr~ 


na 


( 21 ) 


and it follows from the assumption (A3) that 


lim inf log { 1 + -—^- -J -—[> > 0. 


ncr 


( 22 ) 


Because y t A^y /(nd 2 ) = 1 + o p (l) and y t A^y/(na 2 ) = 1 + o p (l), we obtain 


n 


-i 


112 ~ 0 p (l). 


(23) 
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Second, we evaluate Io . It follows that 

log | X^Xj + Wj 1 ] = pj logn + log | X^Xj/n + W^ 1 /n\ = p 3 logn + 0(1). 
In addition, log \X\X * + W~ 1 \ = p* logn + 0(1). Then, 

n _1 / 2 = (pj — p*)n^ 1 logn + o(l) = o(l). 

Lastly, it is easy to see that 

n _1 / 3 = o(l). 

From ()2T]) - (|25]h it follows that 

p{ia, 1 (j)-ia,i(j*)>o}^i, 


for all j G J—. 

[Case of j G J+ \ {j*}]. First, we evaluate I\. From 


cu 


dj = e\H 3 


H*)e/n = O p (n x ), 


it follows that 


(logn) On = (logn) -nlog 


- & - aj) 

d? 


= (logn) 1 • n • log{l + O p (n x )} = o p (l). 

For ii2, from (l27j) and y t A 3 y — y t A if y = O p (l), we obtain 

/is = y t A j y/ d| - y t A*y/&l 

= (y t A j y - y t A*y)j d* + O p (l) = O p (l). 


Then, 

(logn) _1 /i2 = o p (l). 

Second, we evaluate J 2 . Since p 3 > p* for all j G J + \ { j*}, 

lim inf(logn) _1 / 2 = pj — p* > 0. 

n—> oo 

Finally, it is easy to see that 

(log n) _1 J 3 = o(l). 

From (l28l) - (l3ll) . it follows that 

P{IC Wi i(j)-IC w ,2(j.)>0}->1, 


for all j e J7+ \ {j.}. 

Combining i'ol and 11331) . we obtain 

P{IC, a (j) - Kb ,!.,,) > 0} -> 1, 
for all , 0 J \ {j,}, which shows that IC,,., is consistent. 


(24) 

(25) 

(26) 

(27) 

(28) 


(29) 

(30) 

(31) 

(32) 

□ 
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